playlist_id) from the data
set.In this exercise, we will require the tidyverse,
knitr and janitor packages.
library(tidyverse)
library(knitr)
library(janitor)
library(gt)
The goal of this exercise is to try some of the data manipulation
options available in R. The data we are looking at today are songs
downloaded from Spotify via the spotifyr package. You can
even use this package to download your own playlists!
The file is called
spotify_songs.csvand you can read the file in using theread_csvfunction. Choosesongsas the name of the data frame if you want to be consistent with the rest of the exercise and the solutions we provide. Explore the variables in RStudio or using code. A detailed summary of each of the variables can be found here: https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md.
songs <- read_csv("spotify_songs.csv")
Rows: 32833 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(songs)
Rows: 32,833
Columns: 23
$ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
$ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
$ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
$ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
$ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
$ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
$ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
$ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
$ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
$ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
$ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
$ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
$ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
$ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
$ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
$ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
$ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
$ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
$ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
$ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
$ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
$ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
playlist_id) from
the data set.Use the
select()function to remove the variable. This requires you to use a negative sign.Give this new data frame a new name.
songs2 <- songs %>%
select(-playlist_id)
Use the
rename()function to rename the variablemodeso that it indicates that it is a mode of the key in particular.Create a new code chunk that includes the code from (b) as well, so that it can all be run together as one chunk.
songs2 <- songs %>%
select(-playlist_id) %>%
rename(key_mode = mode)
The danceability variable is currently a number from 0 to 1. Use the
mutate()function to create a new percentage danceability from 0 to 100.Include all the code in one chunk again, as you continue building on it.
songs2 <- songs %>%
select(-playlist_id) %>%
rename(key_mode = mode) %>%
mutate(dance100 = danceability*100)
Use the
arrange()function to order the data frame, on your new danceability scale. Recall thatdesc()withinarrangeuses descending order.What are some of the top songs for danceability?
Add the existing code to this chunk as well like above.
songs2 <- songs %>%
select(-playlist_id) %>%
rename(key_mode = mode) %>%
mutate(dance100 = danceability*100) %>%
arrange(desc(dance100))
songs2
# A tibble: 32,833 × 23
track_id track…¹ track…² track…³ track…⁴ track…⁵ track…⁶ playl…⁷ playl…⁸
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 0U7sbXtHiRvv… If Onl… Fusion… 41 0QOi08… If Onl… 2016-0… House/… edm
2 4uHMfLdd3IbY… Mega R… DJ Zsu… 27 2n9oVY… Mega R… 2018-0… Electr… edm
3 3XVozq1aeqsJ… Ice Ic… Vanill… 70 20O6lf… Vanill… 2008-1… 90s Da… pop
4 4a5nDDqiQX6a… Enseña… DJ Goo… 65 2RMlIj… Enseña… 2019-0… Verano… latin
5 3Xv5C02Wxlek… Cha Ch… DJ Cas… 54 3Ogg26… Cha Ch… 2004-0… School… latin
6 4dA9s3ai98TH… Slow D… India.… 27 5gnsCH… Voyage… 2002-0… Neo So… r&b
7 01IQ4aQgOf0K… Funky … Dave 72 3CFVTs… Funky … 2018-1… Rap Wo… rap
8 5vOLEEbDyprZ… Get Do… Centra… 5 13MEhm… Underw… 2004-0… Chican… latin
9 1GeNui6m825V… Bad Ba… Young … 81 1bnHPO… So Muc… 2019-0… Hip-Ho… rap
10 4TJ56OkWrnf2… In da … Trick … 48 4uHDWJ… Thug H… 2002-0… Southe… rap
# … with 32,823 more rows, 14 more variables: playlist_subgenre <chr>,
# danceability <dbl>, energy <dbl>, key <dbl>, loudness <dbl>,
# key_mode <dbl>, speechiness <dbl>, acousticness <dbl>,
# instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
# duration_ms <dbl>, dance100 <dbl>, and abbreviated variable names
# ¹track_name, ²track_artist, ³track_popularity, ⁴track_album_id,
# ⁵track_album_name, ⁶track_album_release_date, ⁷playlist_name, …
Use the
filter()function to select songs with danceability greater than or equal to 90%.How many songs are in this category? Use
tabylto see how many highly danceable songs are in each genre. Be careful not to override the original data set with this table. There is no need to assign thistabylto an object unless you intend to refer to it later.It would be useful to know how the proportions of each genre compare to the entire set of songs as well. So instead of filtering by high danceability, try creating a binary variable that indicates high danceability (90 or more) compared with the alternative, using the
mutate()function. You may like to save this version of the data set with a new name.
songs %>%
select(-playlist_id) %>%
rename(key_mode = mode) %>%
mutate(dance100 = danceability*100) %>%
arrange(desc(dance100)) %>%
filter(dance100>=90) %>%
tabyl(playlist_genre) %>%
adorn_pct_formatting() %>%
gt()
| playlist_genre | n | percent |
|---|---|---|
| edm | 93 | 12.5% |
| latin | 99 | 13.3% |
| pop | 47 | 6.3% |
| r&b | 120 | 16.1% |
| rap | 368 | 49.5% |
| rock | 17 | 2.3% |
There seem to be a large number of rap songs with high danceability.
songs2 <- songs %>%
select(-playlist_id) %>%
rename(key_mode = mode) %>%
mutate(dance100 = danceability*100) %>%
arrange(desc(dance100)) %>%
mutate(highdance = dance100>=90) %>%
mutate(highdance = replace(highdance, highdance == FALSE, "low")) %>%
mutate(highdance = replace(highdance, highdance == TRUE, "high"))
songs2 %>%
tabyl(playlist_genre, highdance) %>%
gt()
| playlist_genre | high | low |
|---|---|---|
| edm | 93 | 5950 |
| latin | 99 | 5056 |
| pop | 47 | 5460 |
| r&b | 120 | 5311 |
| rap | 368 | 5378 |
| rock | 17 | 4934 |
Use the replace() function within mutate to
give this new variable a more informative level names (rather than the
default TRUE or FALSE). From here we can see
that not only are there a large number of rap songs with high
danceability; the highest percentage of high danceability songs is also
found in the rap genre.
Note also that we separate the data manipulation step from the
tabyl step, so that the updated data (and not the table)
are saved with the name songs2, which we can refer to
below.
Consider the relationship between popularity (
track_popularity) and danceability for the full data set. What visual display could you use to explore this relationship?Hint: there are a large number of data points, so there will be a lot of overlap in the points. There are a few techniques that can help. For example, you can add additional information with a
geom_smooth()curve. You can also summarise the points in each location with something likegeom_hex()(requiring packagehexbin), or make the points transparent (geom_point(alpha = 0.1)) or very small (geom_point(shape= ".")) Play around with these options to see what you think works best. Don’t be surprised if these figures take some time to generate as there are a large number of data points for each.How might you explore this relationship for different genres (
playlist_genre)?
ggplot(songs2,
aes(x = dance100,
y = track_popularity)) +
geom_point()
ggplot(songs2,
aes(x = dance100,
y = track_popularity)) +
geom_point() + geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(songs2,
aes(x = dance100,
y = track_popularity)) +
geom_point(alpha = 0.1)
ggplot(songs2,
aes(x = dance100,
y = track_popularity)) +
geom_point(shape = ".")
library(hexbin)
ggplot(songs2,
aes(x = dance100,
y = track_popularity)) +
geom_hex()
Here are a range of options. For the genre figure, let’s add colour and
smoothed lines by genre.
ggplot(songs2,
aes(x = dance100,
y = track_popularity,
colour = playlist_genre)) +
geom_point() + geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
It is still a bit hard to see what is happening with all the genres on the same plot, so we will also try panelling by genre and add some transparency as well.
ggplot(songs2,
aes(x = dance100,
y = track_popularity)) +
geom_point(alpha = 0.1) + geom_smooth() +
facet_wrap(vars(playlist_genre), nrow=3)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
In general, there is not evidence of a strong relationship between danceability and popularity of a track, though there are some slight positive trends (for example, for rap).
Consider some other relationships between variables that you are interested in and create the code to explore these, with summary tables and/or visual displays.
Download the airport screening file used in lectures and perform some of your own data transformations and summaries of these variables.
© 2022 Statistical Consulting Centre, The University of Melbourne.